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Boosting is one of the most significant advances in macliine learn- 
ing for classification and regression. In its original and computation- 
ally flexible version, boosting seeks to minimize empirically a loss 
function in a greedy fashion. The resulting estimator takes an additive 
function form and is built iteratively by applying a base estimator (or 
learner) to updated samples depending on the previous iterations. An 
unusual regularization technique, early stopping, is employed based 
on CV or a test set. 

This paper studies numerical convergence, consistency and statis- 
tical rates of convergence of boosting with early stopping, when it is 
carried out over the linear span of a family of basis functions. For 
general loss functions, we prove the convergence of boosting's greedy 
optimization to the inflnimum of the loss function over the linear 
span. Using the numerical convergence result, we find early-stopping 
strategies under which boosting is shown to be consistent based on 
i.i.d. samples, and we obtain bounds on the rates of convergence for 
boosting estimators. Simulation studies are also presented to illus- 
trate the relevance of our theoretical results for providing insights to 
practical aspects of boosting. 

As a side product, these results also reveal the importance of re- 
stricting the greedy search step-sizes, as known in practice through 
the work of Friedman and others. Moreover, our results lead to a 
rigorous proof that for a linearly separable problem, AdaBoost with 
£ — > step-size becomes an L^-margin maximizer when left to run to 
convergence. 

1. Introduction. In this paper we consider boosting algoritiims for clas- 
sification and regression. These algorithms represent one of the major ad- 
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vances in machine learning. In their original version, the computational as- 
pect is explicitly specified as part of the estimator /algorithm. That is, the 
empirical minimization of an appropriate loss function is carried out in a 
greedy fashion, which means that at each step a basis function that leads 
to the largest reduction of empirical risk is added into the estimator. This 
specification distinguishes boosting from other statistical procedures which 
are defined by an empirical minimization of a loss function without the nu- 
merical optimization details. 

Boosting algorithms construct composite estimators using often simple 
base estimators through the greedy fitting procedure. An unusual regular- 
ization technique, early stopping, is employed based on CV or a test set. This 
family of algorithms has been known as the stagewise fitting of additive mod- 
els in the statistics literature [18, 17]. For the squared loss function, they 
were often referred to in the signal processing community as matching pur- 
suit [29] . More recently, it was noticed that the AdaBoost method proposed 
in the machine learning community [13] can also be regarded as stagewise fit- 
ting of additive models under an exponential loss function [7, 8, 15, 31, 34]. In 
this paper we use the term boosting to indicate a greedy stagewise procedure 
to minimize a certain loss function empirically. The abstract formulation will 
be presented in Section 2. 

Boosting procedures have drawn much attention in the machine learning 
community as well as in the statistics community, due to their superior 
empirical performance for classification problems. In fact, boosted decision 
trees are generally regarded as the best off-the-shelf classification algorithms 
we have today. In spite of the significant practical interest in boosting, a 
number of theoretical issues have not been fully addressed in the literature. 
In this paper we hope to fill some gaps by addressing three basic issues 
regarding boosting: its numerical convergence when the greedy iteration 
increases, in Section 4.1; its consistency (after early stopping) when the 
training sample size gets large, in Sections 3.3 and 5.2; and bounds on the 
rate of convergence for boosting estimators, in Sections 3.3 and 5.3. 

It is now well known that boosting forever can overfit the data (e.g., 
see [16, 19]). Therefore, in order to achieve consistency, it is necessary to 
stop the boosting procedure early (but not too early) to avoid overfitting. 
In the early stopping framework, the consistency of boosting procedures has 
been considered by Jiang for exponential loss [19] boosting (but the con- 
sistency is in terms of the classification loss) and Biihlmann under squared 
loss [10] for tree-type base classifiers. Jiang's approach also requires some 
smoothness conditions on the underlying distribution, and it is nonconstruc- 
tive (hence does not lead to an implementable early-stopping strategy). In 
Sections 3.3 and 5.2 we present an early-stopping strategy for general loss 
functions that guarantees consistency. 
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A different method of achieving consistency (and obtaining rate of conver- 
gence results) is through restricting the weights of the composite estimator 
using the 1-norm of its coefficients (with respect to the basis functions). For 
example, this point of view is taken up in [5, 28, 30]. In this framework, 
early stopping is not necessary since the degree of overfitting or regulariza- 
tion is controlled by the 1-norm of the weights of the composite estimator. 
Although this approach simplifies the theoretical analysis, it also introduces 
an additional control quantity which needs to be adjusted based on the data. 
Therefore, in order to select an optimal regularization parameter, one has 
to solve many different optimization problems, each with a regularization 
parameter. Moreover, if there are an infinite (or extremely large) number of 
basis functions, then it is not possible to solve the associated 1-norm regular- 
ization problem. Note that in this case greedy boosting (with approximate 
optimization) can still be applied. 

A question related to consistency and rate of convergence is the conver- 
gence of the boosting procedure as an optimization method. This is clearly 
one of the most fundamental theoretical issues for boosting algorithms. Pre- 
vious studies have focused on special loss functions. Specifically, Mallat and 
Zhang proved the convergence of matching pursuit in [29], which was then 
used in [10] to study consistency; in [9] Breiman obtained an infinite-sample 
convergence result of boosting with the exponential loss function for itl-trees 
(under some smoothness assumptions on the underlying distribution), and 
the result was used by Jiang to study the consistency of AdaBoost. In [12] 
a Bregman divergence-based analysis was given. A convergence result was 
also obtained in [31] for a gradient descent version of boosting. 

None of these studies provides any information on the numerical speed of 
convergence for the optimization. The question of numerical speed of conver- 
gence has been studied when one works with the 1-norm regularized version 
of boosting where we assume that the optimization is performed in the con- 
vex hull of the basis functions. Specifically, for function estimation under 
least-squares loss, the convergence of the greedy algorithm in the convex 
hull was studied in [1, 20, 25]. For general loss functions, the convergence of 
greedy algorithms (again, the optimization is restricted to the convex hull) 
was recently studied in [37] . In this paper we apply the same underlying idea 
to the standard boosting procedure where we do not limit the optimization 
to the convex hull of the basis functions. The resulting bound provides in- 
formation on the speed of convergence for the optimization. An interesting 
observation of our analysis is the important role of small step-size in the 
convergence of boosting procedures. This provides some theoretical justifi- 
cation for Friedman's empirical observation [14] that using small step-sizes 
almost always helps in boosting procedures. 

Moreover, the combination of numerical convergence results with mod- 
ern empirical process bounds (based on Rademacher complexity) provides 
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a way to derive bounds on the convergence rates of early-stopping boosting 
procedures. These results can be found in Sections 3.3 and 5.3. Section 6 
contains a simulation study to show the usefulness of the insights from our 
theoretical analyses in practical implementations of boosting. The proofs of 
the two main results in the numerical convergence section (Section 4.1) are 
deferred to Section A. 2. Section A. 3 discusses relaxations of the restricted 
step-size condition used for earlier results, and Section A. 4 uses numerical 
convergence results to give a rigorous proof of the fact that for separable 
problems, AdaBoost with small step-size becomes an Li margin maximizer 
at its limit (see [18]). 

2. Abstract boosting procedure. We now describe the basics to define 
the boosting procedure that we will analyze in this paper. A similar setup can 
be found in [31]. The main difference is that the authors in [31] use a gradient 
descent rule in their boosting procedure while here we use approximate 
minimization. 

Let S he a set of real-valued functions and define 

span(S) = I £ :feS,w^eR,meZ+'^, 

which forms a linear function space. For all / G span(S'), we can define the 
1-norm with respect to the basis S as 

(1) ll/lli = inf I / = pw^f : G 5, m G Z+| . 

We want to find a function / G span(5) that approximately solves the 
optimization problem 

/Gspan(5) 

where ^ is a convex function of / defined on span(S'). Note that the optimal 
value may not be achieved by any / G span(5'), and for certain formulations 
(such as AdaBoost) it is possible that the optimal value is not finite. Both 
cases are still covered by our results, however. 

The abstract form of the greedy boosting procedure (with restricted step- 
size) considered in this paper is given by the following algorithm: 

Algorithm 2.1 (Greedy boosting). 

Pick /o G span(5') 
for fc = 0,l,2,... 

Select a closed subset A}^ C R such that G Aj^ and A^ = — A^ 
Find Ok G A^ and gk £ S to approximately minimize the function: 



BOOSTING WITH EARLY STOPPING 



5 



(*) {ak,gk) Mfk + akgk) 

Let fk+i = fk + akgk 
end 

Remark 2.1. The approximate minimization of (*) in Algorithm 2.1 
should be interpreted as finding Uk G and gk^ S such that 

(3) Mfk + akgk)< inf A{fk + akgk) + Bk, 

where > is a sequence of nonnegative numbers that converges to zero. 

Remark 2.2. The requirement that S A^ is not crucial in our analysis. 
It is used as a convenient assumption in the proof of Lemma 4.1 to simplify 
the conditions. Our convergence analysis allows the choice of A^ to depend 
on the previous steps of the algorithm. However, the most interesting A^, 
for the purpose of this paper will be independent of previous steps of the 
algorithm: 

(a) Ak = R, ^ 

(b) sup Afc = hk where /ifc > and hk — > 0. 

As we will see later, the restriction of ak to the subset A^ C -R is useful in 
the convergence analysis. 

As we shall see later, the step-size ak plays an important role in our 
analysis. A particular interesting case is to restrict the step-size explicitly. 
That is, we assume that the starting point /o, as well as quantities et and 
Ak in (3), are sample-independent, and hk = supAfc satisfies the conditions 

oo oo 

(4) = ^] < 

j=0 j=0 

The reason for this condition will become clear in the numerical convergence 
analysis of Section 4.1. 

3. Assumptions and main statistical results. The purpose of this section 
is to state assumptions needed for the analyses to follow, as well as the 
main statistical results. There are two main aspects of our analysis. The 
first is the numerical convergence of the boosting algorithm as the number 
of iterations increases, and the second is the statistical convergence of the 
resulting boosting estimator, so as to avoid overfitting. We list respective 
assumptions separately. The statistical consistency result can be obtained 
by combining these two aspects. 
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3.1. Assumptions for the numerical convergence analysis. For all / G 
span(S') and g & S, we define a real-valued function Af^g{-) as 

Aj,g{h)=A{f + hg). 



Definition 3.1. Let A{f) be a function of / defined on span(5). Denote 
by span(S')' the dual space of span(5') [i.e., the space of real- valued linear 
functionals on span(S')]. We say that A is differentiable with gradient Vj4 G 
span(S')' if it satisfies the following Frechet-like differentiability condition 
for all f,g£ span(5): 

lim + hg) - A{f)) = VAiffg, 

where VA{f)^g denotes the value of the linear functional VA{f) at g. Note 
that we adopt the notation f^g from linear algebra, where it is just the 
scalar product of the two vectors. 

For reference, we shall state the following assumption, which is required 
in our analysis. 

Assumption 3.1. Let A{f) be a convex function of / defined on span(5), 
which satisfies the following conditions: 

1. The functional A is differentiable with gradient VA. 

2. For all / G span(5') and g G S, the real- valued function Aj g is second- 
order differentiable (as a function of h) and the second derivative satisfies 

(5) 4,,(0)<Af(||/||i), 

where M(-) is a nondecreasing real-valued function. 

Remark 3.1. A more general form of (5) is ^/,g(0) < £{g)M{\\f\\i), 
where i{g) is an appropriate scaling factor of 5. For example, in the examples 
given below, £{g) can be measured by sup^|5(x)| or Exg{X)'^. In (5) we 
assume that functions in S are properly scaled so that i{g) < 1. This is for 
notational convenience only. With more complicated notation techniques 
developed in this paper can also handle the general case directly without 
any normalization assumption of the basis functions. 

The function M(-) will appear in the convergence analysis in Section 4.1. 
Although our analysis can handle unbounded M(-), the most interesting 
boosting examples have bounded M(-) (as we will show shortly). In this 
case we will also use M to denote a real- valued upper bound of sup„M(a). 
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For statistical estimation problems such as classification and regression 
with a covariate or predictor variable X and a real response variable Y 
having a joint distribution, we are interested in the following form of A{f) 
in (2): 

(6) A{f) = ij{Ex,Y4>U{X),Y)), 

where </>(•,-) is a loss function that is convex in its first argument and "0 is a 
monotonic increasing auxiliary function which is introduced so that A[f) is 
convex and M(-) behaves nicely (e.g., bounded). We note that the intro- 
duction of is for proving numerical convergence results using our proof 
techniques, which are needed for proving statistical consistency of boosting 
with early stopping. However, is not necessary for the actual implementa- 
tion of the boosting procedure. Clearly the minimizer of (6) that solves (2) 
does not depend on the choice of ij:. Moreover, the behavior of Algorithm 2.1 
is not affected by the choice of ijj as long as in (3) is appropriately re- 
defined. We may thus always take il}{u) = u, but choosing other auxiliary 
functions can be convenient for certain problems in our analysis since the 
resulting formulation has a bounded M(-) function (see the examples given 
below). We have also used Ex,y to indicate the expectation with respect to 
the joint distribution of {X,Y). 

When not explicitly specified, Ex,y can denote the expectation either 
with respect to the underlying population or with respect to the empirical 
samples. This makes no difference as far as our convergence analysis in 
Section 4.1 is concerned. When it is necessary to distinguish an empirical 
quantity from its population counterpart, we shall denote the former by a hat 
above the corresponding quantity. For example, E denotes the expectation 
with respect to the empirical samples, and A is the function in (6) with Ex,y 
replaced by Ex,y- This distinction will become necessary in the uniform 
convergence analysis of Section 4.2. 

An important application of boosting is binary classification. In this case 
it is very natural for us to use a set of basis functions that satisfy the 
conditions 

(7) sup \g{x)\ < 1, y = ±1. 

g£S,x 

For certain loss functions (such as least squares) this condition can be re- 
laxed. In the classification literature 4>{f,y) usually has a form cj){fy). 

Commonly used loss functions are listed in Section A.l. They show that 
for a typical boosting loss function 0, there exists a constant M such that 
supa Af(a) < M. 
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3.2. Assumptions for the statistical convergence analysis. In classifica- 
tion or regression problems with a covariate or predictor variable X on R'^ 
and a real response variable Y, we observe m i.i.d. samples = {{Xi,Yi), . . . , {Xm,Ym)} 
from an unknown underlying distribution D. Consider a loss function (/>(/, y) 
and define Q{f) (true risk) and Q{f) (empirical risk) as 

1 m 

(8) Q{f) = EDHfiX),Y), Q{f) = mf{X),Y) = -Y^ct>{f{X,),Y,), 

1=1 

where Eo is the expectation over the unknown true joint distribution D 
of {X,Y) (denoted by Ex,y previously); E is the empirical expectation 
based on the sample 

Boosting estimators are constructed by applying Algorithm 2.1 with re- 
spect to the empirical expectation E with a set S of real-valued basis func- 
tions g{x). We use A{f) to denote the empirical objective function, 

A{f)=^{Q{f))=i^{mf{X),Y)). 

Similarly, quantities ft, otk and gk in Algorithm 2.1 will be replaced by /fc, 
dfc and 5^, respectively. 

Techniques from modern empirical process theory can be used to analyze 
the statistical convergence of a boosting estimator with a finite sample. In 
particular, we use the concept of Rademacher complexity, which is given by 
the following definition. 

Definition 3.2. Let G = {g{x,y)} be a set of functions of input (x,y). 
Let {ai\'^Li be a sequence of binary random variables such that cjj = ±1 with 
probability 1/2. The (one-sided) sample-dependent Rademacher complexity 
of G is given by 

Rra{G, ZT) = E„ sup - aig{Xi,Yi), 
and the expected Rademacher complexity of G is denoted by 

Rr,^{G)=Ez^Rm{G,Z'^'). 

The Rademacher complexity approach for analyzing boosting algorithms 
first appeared in [21], and it has been used by various people to analyze 
learning problems, including boosting; for example, see [3, 2, 4, 6, 30]. The 
analysis using Rademacher complexity as defined above can be applied both 
to regression and to classification. However, for notational simplicity we 
focus only on boosting methods for classification, where we impose the fol- 
lowing assumption. This assumption is not essential to our analysis, but it 
simplifies the calculations and some of the final conditions. 
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Assumption 3.2. We consider the following form of in (8): y) = 
4>{fy) with a convex function 4'{a) R such that (j){—a) > (f){a) for all 

a > 0. Moreover, we assume that 

(i) Condition (7) holds. 

(ii) S in Algorithm 2.1 is closed under negation (i.e., / G 5 — > — / G S). 

(iii) There exists a finite Lipschitz constant of in [— /3,/3]: 

V i/ii, 1/21 < /? - Hf2)\ < imifi - M- 

The Lipschitz condition of a loss function is usually easy to estimate. For 
reference, we list 7^ for loss functions considered in Section A.l: 

(a) Logistic regression (j){f) = ln(l + exp(— /)) :^^[I3) < 1. 

(b) Exponential (j){f) = exp(— /) :7^(/3) < exp(/3). 

(c) Least squares (p{f) = (/ - if : -f^P) < 2(/3 + 1). 

(d) Modified least squares (pif) = max(l - /, 0)^ : < 2{/3 + 1). 

(e) p-norm </.(/) = | / - 1^ (p > 2) : 7<^(/3) < P{P + IK^- 

3.3. Main statistical results. We may now state the main statistical re- 
sults based on the assumptions and definitions given earlier. The following 
theorem gives conditions for our boosting algorithm so that consistency can 
be achieved in the large sample limit. The proof is deferred to Section 5.2, 
with some auxiliary results. 

Theorem 3.1. Under Assumption 3.2 let (f> be one of the loss functions 
considered in Section A.l. Assume further that in Algorithm 2.1 we choose 
quantities /o, £k o,nd A^ to be independent of the sample Z™, such that 
J2'jLo^j < 00, and hf^ = supA^ satisfies (4). 

Consider two sequences of sample independent numbers km and j3m such 
that limm^oo = 00 and limm^oo lci){fim)(3mRm{S) = 0. Then as long as we 
stop Algorithm 2.1 at a step k based on Z{™ such that k > km and < 
Pm, "we have the consistency result 

lim Ez^Q(fi)= inf Q(f). 

Remark 3.2. The choice of {km,(3m) in the above theorem should not 
be void, in the sense that for all samples it should be possible to stop 
Algorithm 2.1 at a point such that the conditions k > km and ||/^||i < Pm 
are satisfied. 

In particular, if limm^oo RmiS) = 0, then we can always find km < k'm 

k' 

such that km ^ CO and ^ci,{Pm)PmRm{S) with Pm = ll/olli + J2j=ohj. 
This choice of {km,Pm) is valid as we can stop the algorithm at any k G 
[km, k^]. 
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Similar to the consistency result, we may further obtain some rate of con- 
vergence results. This work does not focus on rate of convergence analysis, 
and results we obtain are not necessarily tight. Before stating a more general 
and more complicated result, we first present a version for constant step-size 
logistic boosting, which is much easier to understand. 



Theorem 3.2. Consider the logistic regression loss function, with ba- 
sis S which satisfies RmiS) < for some positive constant Cs- For each 
sample size m, consider Algorithm 2.1 with /o = 0, sup^, = hQ{m) < 1/ y/m 
and Ek < hQ{m)^/2. Assume that we run boosting for kirn) = f3rn/ho{m) 
steps. Then 



< _ inf 

/espan(S) 



Q{f) + 



{2Cs + 



+ 



+ 1 



m 



+ 



m 



Note that the condition RmiS) < Cs/\/rn is satisfied for many basis 
function classes, such as two- level neural networks and tree basis func- 
tions (see Section 4.3). The bound in Theorem 3.2 is independent of h{)[m) 
[as long as hQ{m) < m"^^"^]. Although this bound is likely to be subopti- 
mal for practice problems, it does give a worst case guarantee for boost- 
ing with the greedy optimization aspect taken into consideration. Assume 
that there exists / G span(5) such that Q{f) = inf jgspan Then we 
may choose f3m as (3rn = 0(||/|||[^^m^/^), which gives a convergence rate 
of EzY^Qiff,) < Qif) + 0(||/||}/'m-V4). As the target complexity ||/||i in- 
creases, the convergence becomes slower. An example is provided in Section 6 
to illustrate this phenomenon. 

We now state the more general result, on which Theorem 3.2 is based (see 
Section 5.3). 



Theorem 3.3. Under Assumption 3.2, let (j){f) >0 be a loss function 
such that A{f) satisfies Assumption 3.1 with the choice '4){a) = a. Given a 
sample size m, we pick a positive nonincreasing sequence {hk} which may 
depend on m. Consider Algorithm 2.1 with /o = 0, sup^ A^ = and < 
hlM{sk+i)/2, where Sk = Efjo hi- 

Given training data, suppose we run boosting for k = k{m) steps, and let 
Pm = Then V/ G span(S') such that Q{f) < Q{0) 

Ez^Qifj,) < Qif) + 2j4Pm)l3mRm.{S) 

+ ^^(-||/lk) + ^^ + MII/li), 

\\f\\l+/3m 
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where 



l<£<fc(m)L/3^ + ll/lll 




hli + {k{m) - e)hj M{Prn + hki^m))- 



If the target function is / which belongs to span(S'), then Theorem 3.3 
can be directly interpreted as a rate of convergence result. However, the 
expression of 5ra may still be quite complicated. For specific loss function 
and step-size choices, the bound can be simplified. For example, the result 
for logistic boosting in Theorem 3.2 follows easily from the theorem (see 
Section 5.3). 

4. Preparatory results. As discussed earlier, it is well known by now 
that boosting can overfit if left to run until convergence. In Section 3.3 we 
stated our main results that with appropriately chosen stopping rules and 
under regularity conditions, results of consistency and rates of convergence 
can be obtained. In this section we begin the proof process of these main 
results by proving the necessary preparatory results, which are interesting 
in their own right, especially those on numerical convergence of boosting in 
Section 4.1. 

Suppose that we run Algorithm 2.1 on the sample Z'l' and stop at step k. 
By the triangle inequality and for any / e span(S'), we have 



The middle term is on a fixed /, and thus it has a rate of conver- 
gence 0{\/\/rn) by the CLT. To study the consistency and rates of con- 
vergence of boosting with early stopping, the work lies in dealing with the 
first and third terms in (9). The third term is on the empirical performance 
of the boosting algorithm, and thus a numerical convergence analysis is 
required and hence proved in Section 4.1. Using modern empirical process 
theory, in Section 4.2 we upper bound the first term in terms of Rademacher 
complexity. 

We will focus on the loss functions (such as those in Section A.l) which 
satisfy Assumption 3.1. In particular, we assume that ip \s a, monotonic 
increasing function, so that minimizing A{f) or A[f) is equivalent to min- 
imizing Q{f) or Q{f). The derivation in Section 4.2 works with Q{f) and 
Q{f) directly, instead of A{f) and A{f). The reason is that, unlike our 
convergence analysis in Section 4.1, the relatively simple sample complexity 
analysis presented in Section 4.2 does not take advantage of ip. 



(9) 



Ez-Q{h) - Q{f) < EzY^lQiff,) - Q{ff,)\ + EzY^lQif) - Q{f)\ 
+ Ez^[Q{fj^)-Q{f)]. 
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4.1. Numerical convergence analysis. Here we consider the numerical 
convergence behavior of fk obtained from the greedy boosting procedure 
as k increases. For notational simphcity, we state the convergence results in 
terms of the population boosting algorithm, even though they also hold for 
the empirical boosting algorithm. The proofs of the two main lemmas are 
deferred to Section A. 2. 

In our convergence analysis, we will specify convergence bounds in terms 
of II /111 (where / is a reference function) and a sequence of nondecreas- 
ing numbers Sk satisfying the following condition: there exist positive num- 
bers hk such that 

k-l 

(10) |afc|</ifcGAfc and let Sfc = ll/olli + Xl'^*' 

i=0 

where {afc} are the step-sizes in (3). Note that in (10) can be taken as 
any number that satisfies the above condition, and it can depend on {a^} 
computed by the boosting algorithm. However, it is often desirable to state 
a convergence result that does not depend on the actual boosting outputs 
(i.e., the actual computed). For such results we may simply fix hk by 
letting hk = sup . This gives convergence bounds for the restricted step- 
size method which we mentioned earlier. 

It can be shown (see Section A. 2) that even in the worse case, the value 
A{fk+i) — A{f) decreases from A{fk) — A{f) by a reasonable quantity. Cas- 
cading this analysis leads to a numerical rate or speed of convergence for 
the boosting procedure. 

The following lemma contains the one-step convergence bound, which is 
the key result in our convergence analysis. 

Lemma 4.1. Assume that A{f) satisfies Assumption 3.1. Consider hk 
and Sk that satisfy (10). Let f be an arbitrary reference function in span(S'), 
and define 

(11) AA{fk) = maK{0,A{fk)-A{f)), 

(12) e-fc = ^M(sfc+i) + efc. 

Then after k steps, the following bound holds for fk+i obtained from Algo- 
rithm 2.1; 

(13) AA{fk+i) <{l- ) AA{fk) + Sk. 

V Sfe + ||/||i/ 

Applying Lemma 4.1 repeatedly, we arrive at a convergence bound for the 
boosting Algorithm 2.1 as in the following lemma. 
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Lemma 4.2. Under the assumptions of Lemma 4.1, we have 



The above lemma gives a quantitative bound on the convergence of A(fk) 
to the value A(f) of an arbitrary reference function / S span(S). We can 
see that the numerical convergence speed of ^(/fe) to A{f) depends on ||/||i 
and the accumulated or total step-size Sk- Specifically, if we choose / such 
that A{f) < j4(/o), then it follows from the above bound that 



(15) 



Sj + l + 



■Sfc+1 + 

Note that the inequality is automatically satisfied when A(fk+i) < A{f). 

Clearly, in order to select / to optimize the bound on the right-hand 
side, we need to balance a trade-off: we may select / such that A{f) (and 
thus the first term) becomes smaller as we increase ||/||i; however, the other 
two terms will become large when ||/||i increases. This bound also reveals 
the dependence of the convergence on the initial value of the algorithm /q: 
the closer A(/o) gets to the infinimum of A, the smaller the bound. To our 
knowledge, this is the first convergence bound for greedy boosting procedures 
with quantitative numerical convergence speed information. 

Previous analyses, including matching pursuit for least squares [29], Breiman's 
analysis [9] of the exponential loss, as well as the Bregman divergence bound 
in [12] and the analysis of gradient boosting in [31], were all limiting results 
without any information on the numerical speed of convergence. The key 
conceptual difference here is that we do not compare to the optimal value 
directly, but instead, to the value of an arbitrary / G span(5), so that ||/||i 
can be used to measure the convergence speed. This approach is also cru- 
cial for problems where A(-) can take —00 as its infinimum, for which a 
direct comparison will clearly fail (e.g., Breiman's exponential loss analysis 
requires smoothness assumptions to prevent this —00 infinimum value). 

A general limiting convergence result follows directly from the above 
lemma. 

Theorem 4.1. Assume that J^'jLo^j < ^ '^''^^ Sj^o^j ~ then we 
have the following optimization convergence result for the greedy boosting 
algorithm (2.1).- 

Ihn A{fk)= inf A{f). 

fc— ♦CO /espan(5) 
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Proof. The assumptions imply that hm^^oo = oo. We can thus con- 
struct a nonnegative integer- valued function k j{k) < k such that limfc_^oo Sj^^-^/ Sk 
and limfc^oo Sj(fc) =00. 

From Lemma 4.2 we obtain for any fixed /, 

. Il/olli + ll/li . ...^ I v ^. + ll/lli .- 

Sk + \\f\\l j^^Sk + WfWl 



^fc + ll-^lll i=l i=i(fc)+l 

Therefore limfc^oo iiiax(0, — ^(/)) = 0. Since our analysis applies to 

any / G span(S'), we can choose fj E span(5) such that lim^ = inf /6span(5)^(/)- 

Now from lim;^._>oo niax(0, — A{fj)) = 0, we obtain the theorem. □ 

Corollary 4.1. For loss functions such as those in Section A.l, we 
have sup^ M(a) < 00. Therefore as long as there exist hj in (10) and Sj 
in (3) such that J2'jLo hj = 00, J2'jLo hj < 00 and J2'jLo £j < 00, we have the 
following convergence result for the greedy boosting procedure: 

lim A{fk)= inf A{f). 

k—foc /Gspan(6) 

The above results regarding population minimization automatically apply 
to the empirical minimization if we assume that the starting point /q , as well 
as quantities Sk and in (3), are sample-independent, and the restricted 
step-size case where hk = supA^ satisfies the condition (4). 

The idea of restricting the step-size when we compute aj was advocated by 
Friedman, who discovered empirically that taking small step-size helps [14]. 
In our analysis, we can restrict the search region so that Corollary 4.1 is 
automatically satisfied. Since we believe this is an important case which 
applies for general loss functions, we shall explicitly state the corresponding 
convergence result below. 

Corollary 4.2. Consider a loss function (e.g., those in Section A.l) 
such that sup^ M(a) < -|-oo. Pick any sequence of positive numbers hj (j >()) 
such that X^j^o ~ ^7=0 h'j < 00. If we choose A^ in Algorithm 2.1 such 
that hk = supAfc, and Ej in (3) such that X^j^o^i < then 



hm A{fk)= inf A{f). 

fe^oo /espan(5) 
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Note that the above resuh requhes that the step-size hj be smah {J2'j^o < 
but also not too small (X^j^o^i — discussed above, the first condi- 

tion prevents large oscillation. The second condition is needed to ensure that 
fk can cover the whole space span(S'). 

The above convergence results are limiting results that do not carry any 
convergence speed information. Although with specific choices of and 
Sk one may obtain such information from (14), the second term on the 
right-hand side is typically quite complicated. It is thus useful to state a 
simple result for a specific choice of hk and Sfc, which yields more explicit 
convergence information. 



Corollary 4.3. Assume that A{f) satisfies Assumption 3.1. Pick a 
sequence of nonincreasing positive numbers hj (j >0). Suppose we choose 
Afc in Algorithm 2.1 such that hj. = supAfc, and choose Sfc in (3) such that 
£fc !^ h'j,M{skJ^i) /2. If we start Algorithm 2.1 with /q = 0, then 



AA{f,) < 



Sk + 



-AA(/„) + _i„f^ 



ise + 



Sk + 



^\l + {k-t)h} 



M{sk+i) 



Proof. Using notation of Lemma 4.1, we have < h'jM{sk+i)- There- 
fore each summand in the second term on the right-hand size of Lemma 4.2 
is no more than h'jM{sk-\-i) when j > £ and is no more than hQM(sk+i){si + 
||/||i)/('Sfc + ||/||i) when j <£. The desired inequality is now a straightfor- 
ward consequence of (14). □ 



Note that similar to the proof of Theorem 4.1, the term {k — i)h'j in 
Corollary 4.3 can also be replaced by X]j=£+i^^- ^ special case of Corol- 
lary 4.3 is constant step-size {h^ = ho) boosting, which is the original version 
of restricted step-size boosting considered by Friedman [14]. This method 
is simple to apply since there is only one step-size parameter to choose. 
Corollary 4.3 shows that boosting with constant step-size (also referred to 
as e-boosting in the literature) converges to the optimal value in the limit 
of /iQ — > 0, as long as we choose the number of iterations k and step-size ho 
such that khQ — > oo and /c/ig 0. To the best of our knowledge, this is the 
only rigorously stated convergence result for the e-boosting method, which 
justifies why one needs to use a step-size that is as small as possible. 

It is also possible to handle sample-dependent choices of A^ in Algo- 
rithm 2.1, or allow unrestricted step-size (A^ = R) for certain formula- 
tions. However, the corresponding analysis becomes much more complicated. 
According to Friedman [14], the restricted step-size boosting procedure is 
preferable in practice. Therefore we shall not provide a consistency analysis 
for unrestricted step-size formulations in this paper; but see Section A. 3 for 
relaxations of the restricted step-size condition. 
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In addition to the above convergence results for general boosting algo- 
rithms, Lemma 4.2 has another very useful consequence regarding the limit- 
ing behavior of AdaBoost in the separable classification case. It asserts that 
the infinitely small step-size version of AdaBoost, in the convergence limit, 
is an Li margin maximizer. This result has been observed through a connec- 
tion between boosting with early stopping and Li constrained boosting (see 
[18]). Our analysis gives a direct and rigorous proof. This result is interesting 
because it shows that AdaBoost shares some similarity (in the limit) with 
support vector machines (SVMs) whose goal in the separable case is to find 
maximum margin classifiers; the concept of margin has been popularized by 
Vapnik [36] who used it to analyze the generalization performance of SVMs. 
The detailed analysis is provided in Section A. 4. 

4.2. Uniform convergence. There are a number of possible ways to study 
the uniform convergence of empirical processes. In this section we use a 
relatively simple approach based on Rademacher complexity. Examples with 
neural networks and tree-basis (left orthants) functions will be given to 
illustrate our analysis. 

The Rademacher complexity approach for analyzing boosting algorithms 
appeared first in [21]. Due to its simplicity and elegance, it has been used 
and generalized by many researchers [2, 3, 4, 6, 30]. The approach used here 
essentially follows Theorem 1 of [21], but without concentration results. 

From Lemma 4.2 we can see that the convergence of the boosting proce- 
dure is closely related to ||/||i and ||/fc||i. Therefore it is natural for us to 
measure the learning complexity of Algorithm 2.1 based on the 1-norm of 
the function family it can approximate at any given step. We shall mention 
that this analysis is not necessarily the best approach for obtaining tight 
learning bounds since the boosting procedure may effectively search a much 
smaller space than the function family measured by the 1-norm ||/fc||i. How- 
ever, it is relatively simple, and sufficient for our purpose of providing an 
early-stopping strategy to give consistency and some rate of convergence 
results. 

Given any /? > 0, we now would like to estimate the rate of uniform con- 
vergence, 

Ri = Ez^ sup (Q(/)-Q(/)), 

ll/lli</3 

where Q and Q are defined in (8). 

The concept of Rademacher complexity used in our analysis is given in 
Definition 3.2. For simplicity, our analysis also employs Assumption 3.2. As 
mentioned earlier, the conditions are not essential, but rather they simplify 
the final results. For example, the condition (7) implies that V / G span(S'), 
|/(x)| < ll/lli. It follows that V/3 > ll/lli, (/._(/, 2/) < </>(-/3). This inequality, 
although convenient, is certainly not essential. 
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Lemma 4.3. Under Assumption 3.2, 
(16) Ri = Ez^ sup [ED(^{fiX),Y) - E(^{f{X),Y)]<2j^{(3)PR^{S), 

ll/l|i</3 

where 7<ji(/3) is a Lipschitz constant of cj) in [— :V|/i|, I/2I < (3: \ 4>{fi) — 

Hf2)\<im\fi-f2\- 

Proof. Using the standard symmetrization argument (e.g., see Lemma 2.3.1 
of [35]), we have 

Ri = Ez^- sup [ED<P{f{X),Y)-E(t){f{X),Y)] 

\\f\\i<P 

<2i2,„({0(/(X),y):||/||i </?}). 

Now the one-sided Rademacher process comparison result in [32], Theo- 
rem 7, which is essentially a slightly refined result (with better constant) of 
the two-sided version in [24], Theorem 4.12, implies that 

Rm{{ct>U{X),Y) : ll/lli < /3}) < 74/5)^-({/(X) : ||/||i < /?}). 

Using the simple fact that g = J2i o^ifi (Si l^^il = 1) implies g < max(supj /j, supj 
and that S is closed under negation, it is easy to verify that RmiS) = 
RM G span(5) : ||/||i < 1}). Therefore 

RMiX):\\f\\i<P}) = PRm.{S). 

Now by combining the three inequalities, we obtain the lemma. □ 

4.3. Estimating Rademacher complexity. Our uniform convergence re- 
sult depends on the Rademacher complexity Rm{S). For many function 
classes, it can be estimated directly. In this section we use a relation be- 
tween Rademacher complexity and ^2-covering numbers from [35]. 

Let X = {Xi, . . . , Xm} be a set of points and let Qm be the uniform prob- 
ability measure over these points. We define the i2{Qm) distance between 
any two functions / and g as 

/ 1 m 

h{Qm)if,g) = [-Y.\f(^i)-9i^i)\^ 

\ 1=1 

Let be a class of functions. The empirical i2-covering number of F, de- 
noted by N{e, F,£2iQm)), is the minimal number of balls {g ■(^2iQm){g, f) < 
e} of radius e needed to cover F. The uniform £2 covering number is given 
by 

N2{e,F,m) = snpN{e,F,£2{Qm)), 

Qm 
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where the supremum is over all probability distribution Qm over samples 
of size m. li F contains 0, then there exists a universal constant C (see 
Corollary 2.2.8 in [35]) such that 



Rm{F) < 




where we assume that the integral on the right-hand side is finite. Note that 
for a function class F with divergent integration value on the right-hand side, 
the above inequality can be easily modified so that we start the integration 
from a point Eq > instead of 0. However, the dependency of Rm{F) on m 
can be slower than 



Assumption 4.1. F satisfies the condition 



sup/ J\ogN2{e, F,m) de < oo. 
m Jo 



A function class F that satisfies Assumption 4.1 is also a Donsker class, for 
which the central limit theorem holds. In statistics and machine learning, 
one often encounters function classes F with finite VC-dimension, where 
the following condition holds (see Theorem 2.6.7 of [35]) for some constants 
C and V independent of m: N2{£, F,m) < C{l/e)^ . Clearly a function class 
with finite VC-dimension satisfies Assumption 4.1. 

For simplicity, in this paper we assume that S satisfies Assumption 4.1. 
It follows that 

(17) Rm{S)<Rm{SU{0})< 



m 



where Cs is a constant that depends on S only. This is the condition used 
in Theorem 3.2. We give two examples of basis functions that are often used 
in practice with boosting. 



Two-level neural networks. We consider two-level neural networks in R , 
which form the function space span(S') with S given by 

S = {a{w'^x + b):w€R'^,b£R}, 

where a{-) is a monotone bounded continuous activation function. 

It is well known that S has a finite VC-dimension, and thus satisfies 
Assumption 4.1. In addition, for any compact subset U G R'^, it is also well 
known that span(5) is dense in C{U) (see [26]). 
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Tree-basis functions. Tree-basis (left orthant) functions in R'^ are given 
by the indicator function of rectangular regions, 

S = {/((-oo, ai] X • • • X (-00, ad]) lai, . . . ,ad e R}. 

Similar to two- level neural networks, it is well known that S has a finite 
VC-dimension, and for any compact set U G R"^, span(S') is dense in C{U). 

In addition to rectangular region basis functions, we may also consider a 
basis S consisting of restricted size classification and regression trees (dis- 
joint unions of constant functions on rectangular regions), where we assume 
that the number of terminal nodes is no more than a constant V . Such a 
basis set 5" also has a finite VC-dimension. 



5. Consistency and rates of convergence with early stopping. In this 
section we put together the results in the preparatory Section 4 to prove 
consistency and some rate of convergence results for Algorithm 2.1 as stated 
in the main result Section 3.3. For simplicity we consider only restricted 
step-size boosting with relatively simple strategies for choosing step-sizes. 
According to Friedman [14], the restricted step-size boosting procedure is 
preferable in practice. Therefore we shall not provide a consistency analy- 
sis for unrestricted step-size formulations in this paper. Discussions on the 
relaxation of the step-size condition can be found in Section A. 3. 



5.1. General decomposition. Suppose that we run the boosting algorithm 
and stop at an early stopping point k. The quantity /c, which is to be specified 
in Section 5.2, may depend on the empirical sample Suppose also that 
the stopping point k is chosen so that the resulting boosting estimator 
satisfies 

(18) \imEz^Q{ff)= mf Qif), 

1 /espan{S) 

where we use Ez^ to denote the expectation with respect to the random 
sample . Since Qiff.) > inf/espan(5) <3(/)7 we also have 



/gspan(6) 



lim Ez" 

m— ♦oo 1 

If we further assume there is a unique /* such that 



il-EzrO(4)-^^.i;£,^,<3(/) = o. 



0(/*)= in! QU). 

/gspan(i) 

and for any sequence {fm}, Qifm) Q{f*) implies that fm f*, then since 
Qiff,) Qif*) as m — > 00, it follows that 

ft f* in probability, 
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which gives the usual consistency of the boosting estimator with an appro- 
priate early stopping if the target function / coincides with /* . This is the 
case, for example, if the regression function /(x) = Eo{Y\x) with respect 
to the true distribution D is in span(S') or can be approximated arbitrarily 
close by functions in span(S'). 

In the following, we derive a general decomposition needed for proving 
(18) or Theorem 3.1 in Section 3.3. Suppose that Assumption 3.2 holds. 
Then for all fixed / G span (5), we have 



EzrlQif) - Q{f)\ < [EzrlQif) - Q{f)\ 



211/2 



-ED\mX)Y)-Q{f)\^ 
m 



1/2 



< 



-EDcP{f{X)Yy 
m 



1/2 ^ 

< 



m 



Assume that we run Algorithm 2.1 on the sample and stop at step 
k. If the stopping point k satisfies -P(||/^||i < Pm) = 1 for some sample- 
independent Pm ^ 0, then using the uniform convergence estimate in (16), 
we obtain 



(19) 



EzrQih) - Qif) 

= Ez^ [Qih) - Qih)] + Ez^-^ [Qif) - Qif)] 
+ Ez^[Qifk)-Qif)] 



< 2^^i[5m)f3mRmiS) + 



m 



i)+sup[Q(4)-Q(/)]. 



5.2. Consistency with restricted step-size boosting. We consider a rela- 
tively simple early-stopping strategy for restricted step-size boosting, where 
we take = sup A^ to satisfy (4). 

Clearly, in order to prove consistency, we only need to stop at a point 
such that V / G span(5'), all three terms in (19) become nonpositive in the 
limit m — > oo. By estimating the third term using Lemma 4.2, we obtain the 
following proof of our main consistency result (Theorem 3.1). 



Proof of Theorem 3.1. Obviously the assumptions of the theorem 
imply that the first two terms of (19) automatically converge to zero. In the 
following, we only need to show that V / G span(S') : sup^m max(0, (5(/^) — 
Qif)) when m — > oo. 
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From Section A.l we know that there exists a distribution- independent 
number M > such that M(a) < M for all underlying distributions. There- 
fore for all empirical samples Zf, Lemma 4.2 implies that 



fo||i + 



-Ai(/o) + E- 



Si + 



,=1 + 



-£7-1, 



where Ai(/) = max(0, - Sk = ||/o||i 
Efc. Now using the inequality AA(/o) < max(V'((/!'( 
c(/) and /c > km, we obtain 



+ E^Jo^i andefc = ^M + 
-||/o||i))-^('/'(ll/li)),0) = 



(20) supAi(A)< sup 



II/0II1 + 



1 / F^ , V- + 



=(/) + E 



Observe that the right-hand side is independent of the sample Prom 
the assumptions of the theorem, we have X^j^o^j < ^ ^"^^ hnifc^oo Sk = 00. 
Now the proof of Theorem 4.1 implies that as km — > 00, the right-hand side 
of (20) converges to zero. Therefore limm-»oo sup^m Aj4(/^) = 0. □ 



The following universal consistency result is a straightforward conse- 
quence of Theorem 3.1. 



Corollary 5.1. Under the assumptions of Theorem 3.1, for any Borel 
set U C R'^, if span{S) is dense in C{U) — the set of continuous functions 
under the uniform-norm topology, then for all Borel measure D on U x 

{-1,1}, 

lim EzrnQ{fi)= inf Q(/), 
where B{U) is the set of Borel measurable functions. 



Proof. We only need to show inf/g<,pan{S) Q{f) = inf/eB(l/) Q(/)- This 
follows directly from Theorem 4.1 of [38]. □ 

Por binary classification problems where y = ±1, given any real- valued 
function /, we predict y = 1 if f{x) > and y = —1 if f{x) < 0. The classi- 
fication error is the following 0-1 loss function: 

i{f{x),y) = l[yf{x)<0], 

where I[E] is the indicator function of the event E, and the expected loss is 



(21) 



L{f) = EM{X),Y). 
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The goal of classification is to find a predictor / to minimize (21). Using 
the notation r/(x) = P(Y = 1\X = x), it is well known that L* , the minimum 
of L{f), can be achieved by setting f{x) = 2rj{x) — 1. Let D be a Borel 
measure defined on [/ x {—1,1}; it is known (e.g., see [38]) that if Q(/) 
'mif£B(u)Q{f)^ then L(/) L* . We thus have the following consistency 
result for binary-classification problems. 

Corollary 5.2. Under the assumptions of Corollary 5.1, we have 

hm Ez^^L{fi^) = L*. 

The stopping criterion given in Theorem 3.1 depends on Rm{S). For S 
that satisfies Assumption 4.1, this can be estimated from (17). The condition 
l^{Pm)PmRmiS) ^ in Theorem 3.1 becomes 70(/?m)/3m = o{^/m). Using 
the bounds for 7(/,(-) in Section 4.2, we obtain the following condition. 

Assumption 5.1. The sequence (3m satisfies: 

(i) Logistic regression (/)(/) = ln(l + exp(— /)) : = o{m^/'^). 

(ii) Exponential 4>{f) = exp(— /) : f5m = o(logm). 

(iii) Least squares </>(/) = (/ — 1)^ : (3m. = o(m^/^). 

(iv) Modified least squares ((){/) = max(0, 1 — /)^ : (3m. = o(m^/^). 

(v) p-norm ^{f) = \f - l\P{p > 2) : /3„ = o{m^/^P). 

We can summarize the above discussion in the following theorem, which 
applies to boosted VC-classes such as boosted trees and two-level neural 
networks. 



Theorem 5.1. Under Assumption 3.2, let (p he one of the loss functions 
considered in Section A.l. Assume further that in Algorithm 2.1 we choose 
the quantities fo, and to he independent of the sample Z™, such that 
J2'j^o^j < '^^^ = supAfc satisfies (4). 

Suppose S satisfies Assumption 4.1 and we choose sample-independent 
km ^oo, such that (3m = ||/o||i +Sj=o^i satisfies Assumption 5.1. // we 
stop Algorithm 2.1 at step km, then ||/a:^||i < (3m and the following consis- 
tency result holds: 

]im EzY^Q{fkJ= inf^ Q(/). 

m^oo 1 " /espan{S) 

Moreover, if span{S) is dense in C{U) for a Borel set U C R'^, then for all 
Borel measures D on U x {—1,1}, we have 

Jim i^zrQ(Aj = ^ mf^^Q(/), Jim i^,j.L(Aj =L*. 
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Note that in the above theorem the stopping criterion km is sample- 
independent. However, similar to Theorem 3.1, we may allow other sample- 
dependent k such that ||/^||i stays within the f3m bound. One may be 
tempted to interpret the rates of Pm- However, since different loss func- 
tions approximate the underlying distribution in different ways, it is not 
clear that one can rigorously compare them. Moreover, our analysis is likely 
to be loose. 

5.3. Some bounds on the rate of convergence. In addition to consistency, 
it is also useful to study statistical rates of convergence of the greedy boost- 
ing method with certain target function classes. Since our analysis is based 
on the 1-norm of the target function, the natural function classes we may 
consider are those that can be approximated well using a function in span(S') 
with small 1-norm. 

We would like to emphasize that rate results, that have been stated in 
Theorems 3.2 and 3.3 and are to be proved here, are not necessarily optimal. 
There are several reasons for this. First, we relate the numerical behavior 
of boosting to 1-norm regularization. In reality, this may not always be the 
best way to analyze boosting since boosting can be studied using other com- 
plexity measures such as sparsity (e.g., see [22] for some other complexity 
measures). Second, even with the 1-norm regularization complexity mea- 
sure, the numerical convergence analysis in Section 4.1 may not be tight. 
This again will adversely affect our final bounds. Third, our uniform conver- 
gence analysis, based on the relatively simple Rademacher complexity, is not 
necessarily tight. For some problems there are more sophisticated methods 
which improve upon our approach here (e.g., see [[2, 3, 4, 5, 6], [22, 30]]). 

A related point is that bounds we are interested in here are a priori 
convergence bounds that are data-independent. In recent years, there has 
been much interest in developing data-dependent bounds which are tighter 
(see references mentioned above). For example, in our case we may allow 
/3 in (16) to depend on the observed data (rather than simply setting it 
to be a value based only on the sample size). This approach, which can 
tighten the final bounds based on observation, is a quite significant recent 
theoretical advance. However, as mentioned above, there are other aspects 
of our analysis that can be loose. Moreover, we are mainly interested in 
worst case scenario upper bounds on the convergence behavior of boosting 
without looking at the data. Therefore we shall not develop data-dependent 
bounds here. 

The statistical convergence behavior of the boosting algorithm relies on its 
numerical convergence behavior, which can be estimated using (14). Com- 
bined with statistical convergence analysis, we can easily obtain our main 
rate of convergence result in Theorem 3.3. 
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Proof of Theorem 3.3. From (19) we obtain 

EzrQih) < Qif)+'^7dPm)P^mRmiS) + || /|| 1 ) + SUp[Q(4) - Q{f)]. 

Now we simply apply Corollary 4.3 to bound the last term. This leads to 
the desired bound. □ 

The result for logistic regression in Theorem 3.2 follows easily from The- 
orem 3.3. 

Proof of Theorem 3.2. Consider logistic regression loss and constant 
step-size boosting, where /i^ = hQ{m). Note that for logistic regression we 
have j^ip) < 1, M(a) < 1, 0(-||/||i) < 1 + ||/||i and (^{0) < 1. Using these 
estimates, we obtain from Theorem 3.3, 



Using the estimate of RmiS) in (17), and letting hQ{m) < l/y/m, we obtain 

s Qif) + + mill + iizt^. 

This leads to the claim. □ 



6. Experiments. The purpose of this section is not to reproduce the 
large number of already existing empirical studies on boosting. Although this 
paper is theoretical in nature, it is still useful to empirically examine various 
implications of our analysis, so that we can verify they have observable 
consequences. For this reason our experiments focus mainly on aspects of 
boosting with early stopping which have not been addressed in previous 
studies. 

Specifically, we are interested in testing consistency and various issues of 
boosting with early stopping based on our theoretical analysis. As pointed 
out in [28], experimentally testing consistency is a very challenging task. 
Therefore, in this section we have to rely on relatively simple synthetic data, 
for which we can precisely control the problem and the associated Bayes risk. 
Such an experimental setup serves the purpose of illustrating main insights 
revealed by our theoretical analyses. 

6.1. Experimental setup. In order to fully control the data generation 
mechanism, we shall use simple one-dimensional examples. A similar exper- 
imental setup was also used in [23] to study various theoretical aspects of 
voting classification methods. 
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Fig. 1. Target conditional probability for d = 2. 



Our goal is to predict Y £ {±1} based on X G [0,1]. Throughout the 
experiments, X is uniformly distributed in [0,1]. We consider the target 
conditional probability of the form P{Y = 1\X) = 2{dX}I{{dX} < 0.5) + 
2(1 — {dX}) X I{{dX} > 0.5), where (i > 1 is an integer which controls the 
complexity of the target function, and / denotes the set indicator function. 
We have also used the notation {z} = z — [z\ to denote the decimal part of 
a real number z, with the standard notation of [z\ for the integer part of z. 
The Bayes error rate of our model is always 0.25. 

Graphically, the target conditional probability contains d triangles. Fig- 
ure 1 plots such a target for d = 2. 

We use one-dimensional stumps of the form /([O, o]) as our basis functions, 
where a is a parameter in [0, 1] . They form a complete basis since each 
interval indicator function I{{a,b]) can be expressed as /([0, 6]) — /([0,a]). 

There have been a number of experimental studies on the impact of using 
different convex loss functions (e.g., see [14, 27, 28, 39]). Although our the- 
oretical analysis applies to general loss functions, it is not refined enough to 
suggest that any one particular loss is better than another. For this reason, 
our experimental study will not include a comprehensive comparison of dif- 
ferent loss functions. This task is better left to dedicated empirical studies 
(such as some of those mentioned above). 

We will only focus on consequences of our analysis which have not been 
well studied empirically. These include various issues related to early stop- 
ping and their impact on the performance of boosting. For this purpose, 
throughout the experiments we shall only use the least-squares loss func- 
tion. In fact, it is known that this loss function works quite well for many 
classification problems (see, e.g., [11, 27]) and has been widely applied to 
many pattern-recognition applications. Its simplicity also makes it attrac- 
tive. 
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Fig. 2. Graphs of boosting estimators after k = 32 and 1024 iterations. 



For the least-squares loss, the target function which the boosting proce- 
dure tries to estimate is f*{x) = 2P{Y = \\X = x) — 1. In our experiments, 
unless otherwise noted, we use boosting with restricted step-size, where at 
each iteration we limit the step-size to be no larger than hi = (i + 
This choice satisfies our numerical convergence requirement, where we need 
the conditions ^^hi = 00 and Z^i < 00. Therefore it also satisfies the con- 
sistency requirement in Theorem 3.1. 



6.2. Early stopping and overfitting. Although it is known that boosting 
forever can overfit (e.g., see [16, 19]), it is natural to begin our experiments 
by graphically showing the effect of early-stopping on the predictive perfor- 
mance of boosting. 

We shall use the target conditional probability described earlier with com- 
plexity d = 2, and training sample-size of 100. Figure 2 plots the graphs of es- 
timators obtained after k = 32 and 1024 boosting iterations. The dotted lines 
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Fig. 3. Predictive performance of boosting as a function of boosting iterations. 
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on the background show the true target function /*(a;) = 2P{Y =\\X = x). 
We can see that after 32 iterations, the boosting estimator, although not 
perfect, roughly has the same shape as that of the true target function. 
However, after 1024 iterations, the graph appears quite random, implying 
that the boosting estimator starts to overfit the data. 

Figure 3 shows the predictive performance of boosting versus the number 
of iterations. The need for early stopping is quite apparent in this example. 
The excessive classification error quantity is defined as the true classification 
error of the estimator minus the Bayes error (which is 0.25 in our case). 
Similarly, the excessive convex loss quantity is defined as the true least- 
squares loss of the estimator minus the optimal least-squares loss of the 
target function /*(x). Both excessive classification error and convex loss are 
evaluated through numerical integration for a given decision rule. Moreover, 
as we can see from Figure 4, the training error continues to decrease as the 
number of boosting iterations increases, which eventually leads to overfitting 
of the training data. 

6.3. Early stopping and total step-size. Since our theoretical analysis fa- 
vors restricted step-size, a relevant question is what step-size we should 
choose. We are not the first authors to look into this issue. For example, 
Friedman and his co-authors suggested using small steps [14, 15]. In fact, 
they argued that the smaller the step-size, the better. They performed a 
number of empirical studies to support this claim. Therefore we shall not 
reinvestigate this issue here. Instead, we focus on a closely related impli- 
cation of our analysis, which will be useful for the purpose of reporting 
experimental results in later sections. 

Let Oi be the step-size taken by the boosting algorithm at the ith itera- 
tion. Our analysis characterizes the convergence behavior of boosting after 




28 



T. ZHANG AND B. YU 



i S3S 

i 













.■-■."Li'i* 














1 1 1 


+ resCrid^d slspsizs < Q.OB 
K restricied stspsizs <; Q.025 



^ D E 

.i 











WO 








^■■■■^■Qr'-'' 

' ; ; i 








'■ '- i 








#6 






K realnclsd s^sfistze < Q.02S 



Fig. 5. Predictive performance of boosting as a function of total step-size. 



the kth step, not by the number of iterations k itself, but rather by the 
quantity Sk = J2i<k (lO)' ^ lo'^S ^ (^i 1^ K ^ ^i- Although our theo- 
rems are stated with the quantity J2i<k ^i, instead of J2i<k does suggest 
that in order to compare the behavior of boosting under different configu- 
rations, it is more natural to use the quantity J2i<k (which we shall call 
total step-size throughout later experiments) as a measure of stopping point 
rather than the actual number of boosting iterations. This concept of total 
step-size also appeared in [18, 17]. 

Figure 5 shows the predictive performance of boosting versus the total 
step-size. We use 100 training examples, with the target conditional proba- 
bility of complexity d = 3. The unrestricted step-size method uses exact op- 
timization. Note that for least-squares loss, as explained in Section A. 3, the 
resulting step-sizes will still satisfy our consistency condition J2i<k^i < 
The restricted step-size scheme with step-size < h employs a constant step- 
size restriction of on < h. This experiment shows that the behavior of these 
different boosting methods is quite similar when we measure the perfor- 
mance not by the number of boosting iterations, but instead by the total 
step-size. This observation justifies our theoretical analysis, which uses quan- 
tities closely related to the total step-size to characterize the convergence 
behavior of boosting methods. Based on this result, in the next few exper- 
iments we shall use the total step-size (instead of the number of boosting 
iterations) to compare boosting methods under different configurations. 

6.4. The effect of sample-size on early stopping. An interesting issue for 
boosting with early stopping is how its predictive behavior changes when 
the number of samples increases. Although our analysis does not offer a 
quantitative characterization, it implies that we should stop later (and the 
allowable stopping range becomes wider) when sample size increases. This 
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Fig. 6. Predictive performance of boosting at different sample sizes. 



essentially suggests that the optimal stopping point in the boosting predic- 
tive performance curve will increase as the sample size increases, and the 
curve itself becomes flatter. It follows that when the sample size is relatively 
large, we should run boosting algorithms for a longer time, and it is less 
necessary to do aggressive early stopping. 

The above qualitative characterization of the boosting predictive curve 
also has important practical consequences. We believe this may be one reason 
why in many practical problems it is very difficult for boosting to overfit, and 
practitioners often observe that the performance of boosting keeps improving 
as the number of boosting iterations increases. 

Figure 6 shows the effect of sample size on the behavior of the boosting 
method. Since our theoretical analysis applies directly to the convergence 
of the convex loss (the convergence of classification error follows implicitly 
as a consequence of convex loss convergence), the phenomenon described 
above is more apparent for excessive convex loss curves. The effect on classi- 
fication error is less obvious, which suggests there is a discrepancy between 
classification error performance and convex loss minimization performance. 



6.5. Early stopping and consistency. In this experiment we demonstrate 
that as sample size increases, boosting with early stopping leads to a con- 
sistent estimator with its error rate approaching the optimal Bayes error. 
Clearly, it is not possible to prove consistency experimentally, which requires 
running a sample size of 00. We can only use a finite number of samples to 
demonstrate a clear trend that the predictive performance of boosting with 
early stopping converges to the Bayes error when the sample size increases. 
Another main focus of this experiment is to compare the performance of 
different early stopping strategies. 
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Fig. 7. Consistency and early stopping. 



Theoretical results in this paper suggest that for least squares loss, we 
can achieve consistency as long as we stop at total step-size approximately 
with p <1 /4, where m is the sample size. We call such an early stop- 
ping strategy the p-strategy. Since our theoretical estimate is conservative, 
we examine the p-strategy both for p = 1/6 and for p = 1/4. Instead of 
the theoretically motivated (and suboptimal) p-strategy, in practice one can 
use cross validation to determine the stopping point. We use a sample size 
of one-third the training data to estimate the optimal stopping total step- 
size which minimizes the classification error on the validation set, and then 
use the training data to compute a boosting estimator which stops at this 
cross-validation-determined total step-size. This strategy is referred to as 
the cross validation strategy. Figure 7 compares the three early stopping 
strategies mentioned above. It may not be very surprising to see that the 
cross-validation-based method is more reliable. The p-strategies, although 
they perform less well, also demonstrate a trend of convergence to consis- 
tency. We have also noticed that the cross validation scheme stops later than 
the p-strategies, implying that our theoretical results impose more restrictive 
conditions than necessary. 

It is also interesting to see how well cross validation finds the optimal 
stopping point. In Figure 8 we compare the cross validation strategy with 
two oracle strategies which are not implementable: one selects the optimal 
stopping point which minimizes the true classification error (which we refer 
to as optimal error), and the other selects the optimal stopping point which 
minimizes the true convex loss (which we refer to as optimal convex risk). 
These two methods can be regarded as ideal theoretical stopping points 
for boosting methods. The experiment shows that cross validation performs 
quite well at large sample sizes. 

In the log coordinate space, the convergence curve of boosting with the 
cross validation stopping criterion is approximately a straight line, which 
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implies that the excess errors decrease as a power of the sample size. By 
extrapolating this finding, it is reasonable for us to believe that boosting 
with early stopping converges to the Bayes error in the limit, which verifies 
the consistency. The two p stopping rules, even though showing much slower 
linear convergence trend, also lead to consistency. 



6.6. The effect of target function complexity on early stopping. Although 
we know that boosting with an appropriate early stopping strategy leads to 
a consistent estimator in the large sample limit, the rate of convergence 
depends on the complexity of the target function (see Section 5.3). In our 
analysis the complexity can be measured by the 1-norm of the target func- 
tion. For target functions considered here, it is not very difficult to show 
that in order to approximate to an accuracy within e, it is only necessary 
to use a combination of our decision stumps with the 1-norm Cd/e. In this 
formula C is a constant and d is the complexity of the target function. 

Our analysis suggests that the convergence behavior of boosting with 
early stopping depends on how easy it is to approximate the target function 
using a combination of basis functions with small 1-norm. A target with 
d = u\s n-times as difficult to approximate as a target with d = l. Therefore 
the optimal stopping point, measured by the total step-size, should accord- 
ingly increase as d increases. Moreover, the predictive performance becomes 
worse. Figure 9 illustrates this phenomenon with (i = 1,3, 5 at the sample 
size of 300. Notice again that since our analysis applies to the convex risk, 
this phenomenon is much more apparent for the excessive convex loss perfor- 
mance than the excessive classification error performance. Clearly this again 
shows that although by minimizing a convex loss we indirectly minimize the 
classification error, these two quantities do not behave identically. 
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7. Conclusion. In this paper we have studied a general version of the 
boosting procedure given in Algorithm 2.1. The numerical convergence be- 
havior of this algorithm has been studied using the so-called averaging tech- 
nique, which was previously used to analyze greedy algorithms for optimiza- 
tion problems defined in the convex hull of a set of basis functions. We have 
derived an estimate of the numerical convergence speed and established con- 
ditions that ensure the convergence of Algorithm 2.1. Our results generalize 
those in previous studies, such as the matching pursuit analysis in [29] and 
the convergence analysis of AdaBoost by Breiman [9]. 

Furthermore, we have studied the learning complexity of boosting algo- 
rithms based on the Rademacher complexity of the basis functions. Together 
with the numerical convergence analysis, we have established a general early 
stopping criterion for greedy boosting procedures for various loss functions 
that guarantees the consistency of the obtained estimator in the large sam- 
ple limit. For specific choices of step-sizes and sample-independent stopping 
criteria, we have also been able to establish bounds on the statistical rate of 
convergence. We would like to mention that the learning complexity analysis 
given in this paper is rather crude. Consequently, the required conditions in 
our consistency strategy may be more restrictive than one actually needs. 

A number of experiments were presented to study various aspects of boost- 
ing with early stopping. We specifically focused on issues that have not been 
covered by previous studies. These experiments show that various quanti- 
ties and concepts revealed by our theoretical analysis lead to observable 
consequences. This suggests that our theory can lead to useful insights into 
practical applications of boosting algorithms. 

APPENDIX 

A.l. Loss function examples. We list commonly used loss functions that 
satisfy Assumption 3.1. They show that for a typical boosting loss function 
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(p, there exists a constant M such that sup^M(o) < M. All loss functions 
considered are convex. 

A. 1.1. Logistic regression. This is a traditional loss function used in 
statistics, which is given by (in natural log form here) 

<A(/,y) =ln(l +exp(-/y)), i){u)=u. 

We assume that the basis functions satisfy the condition 

sup \g{x)\ < 1, y = ±1. 



It can be verified that A{f) is convex differentiable. We also have 
f,9^ ) ^•^(l + exp(/(X)F))(l + exp(-/(X)y)) " 4' 



A. 1.2. Exponential loss. This loss function is used in the AdaBoost algo- 
rithm, which is the original boosting procedure for classification problems. 
It is given by 

=exp(-/y), ilj{u)=lnu. 
Again we assume that the basis functions satisfy the condition 

sup \g{x)\ < 1, y = ±l. 

gGS,x 

In this case it is also not difficult to verify that A(f) is convex differentiable. 
Hence we also have 

,n _ Ex,Yg{XrY^eM-f{X)Y) [Ex,y9{X)Y exp{- f{X)Y)]^ ^ 
■f'^^ > Ex,YeM-fiX)Y) [Ex,YeM-f{X)Y)]^ " ' 



A. 1.3. Least squares. The least squares formulation has been widely 
studied in regression, but can also be applied to classification problems [10, 11, 14, 30]. 
A greedy boosting-like procedure for least squares was first proposed in the 
signal processing community, where it was called matching pursuit [29] . The 
loss function is given by 

</'(/,y) = i(/-2/)', ^l^{u)=u. 

We impose the following weaker condition on the basis functions: 

supExgiXf <1, EyY^ <oo. 

ges 

It is clear that A{f) is convex differentiable, and the second derivative is 
bounded as 

Alg{0)=Exg{Xf<l. 
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A. 1.4. Modified least squares. For classification problems we may con- 
sider the following modified version of the least squares loss, which has a 
better approximation property [38]: 

y) = I max(l - /y,0)^, 'iIj{u)=u. 

Since this loss is for classification problems, we impose the condition 

snpExg{Xf<l, y = ±l. 

It is clear that A{f) is convex differentiable, and we have the following bound 
for the second derivative: 

Alg{0)<Ex9{Xf<l. 

A. 1.5. p-norm boosting, p-norm loss can be interesting both for regres- 
sion and for classification. In this paper we will only consider the case with 
P>2, 

Hf, y) = \f- 2/r, Hu) = ^^^f'^''- 

We impose the condition 

supEx\9{X)\P <1, Ey\Y\p <oo. 

ges 

Now let u = Ex,Y\f{X) + hg{X) - Y\P] we have 

4,9(^) = -^u^^~'''^/^PEx,Yg{X) sign(/(X) + hg{X) - Y) 

X \f{X) + hg{X)-Y\P-\ 
Therefore the second derivative can be bounded as 
A'fJh) = u^'~PyPEx,Y9{X?\fiX) + hg{X) - Y^^ 



p-2 



^j{2-2p)/p 



p-1 

X [Ex,Yg{X) sign(/(X) + hgiX) - Y)\f{X) + hg{X) - Yf-'f 

< u^^~PyPEx,Yg{Xf\fiX) + hgiX) - Yr^ 

< n(2-P)/P^^/^|g(X)|P E'^^y^^^lfiX) + hg{X) - Y\P 

= E'J^y\g{XW<l, 

where the second inequality follows from Holder's inequality with the duality 
pair {p/2,p/{p-2)). 
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Remark A.l. Similar to the least squares case, we can define the mod- 
ified p-norm loss for classification problems. Although the case p G (1, 2) can 
be handled by the proof techniques used in this paper, it requires a modified 
analysis since in this case the corresponding loss function is not second-order 
differentiable at zero. See related discussions in [37]. Note that the hinge loss 
used in support vector machines cannot be handled directly with our cur- 
rent technique since its first-order derivative is discontinuous. However, one 
may approximate the hinge loss with a continuously differentiable function, 
which can then be analyzed. 

A. 2. Numerical convergence proofs. This section contains two proofs for 
the numerical convergence analysis section (Section 4.1). 

Proof of the one-step analysis or Lemma 4.1. Given an arbitrary 
fixed reference function / S span(S') with the representation 

(22) f = Y^w^lj, f^^S, 

j 

we would like to compare A{fk) to A{f). Since / is arbitrary, we use such a 
comparison to obtain a bound on the numerical convergence rate. 

Given any finite subset S' C S such that S' D {fj}, we can represent / 
minimally as 

g&S' 

where Wg, = when g = fj for some j, and Wg, = when g ^ {fj}- A 
quantity that will appear in our analysis is lltt's'Hi = Z^^gS' K'I'I- Since 
II ^S' 111 — 11""^ 111' without any confusion, we will still denote ws' by w with 
the convention that = for all g ^ {fj}- 

Given this reference function /, let us consider a representation of fk as a 
linear combination of a finite number of functions Sk C S, where D {fj} 
is to be chosen later. That is, with g indexing an arbitrary function in S^, 
we expand fk in terms of /^'s which are members of with coefficients /?|: 

(23) fk=T.^lfi- 

With this representation, we define 

AWk = \\w-Pk\\i= E \^'-(^k\- 

Recall that in the statement of the lemma, the convergence bounds are 
in terms of ||w||i and a sequence of nondecreasing numbers Sk, which satisfy 
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the condition 

Sk = \\fo\\i + ^hi, \ak\<hkeAk, 

i=0 

where /i^ can be any real number that satisfies the above inequahty, which 
may or may not depend on the actual step-size a/j computed by the boosting 
algorithm. 

Using the definition of 1-norm for / and since /o G span(S'), it is clear 
that for all e > we can choose a finite subset C S, vector /3k and vector 
w such that 

IIAIIi= E \Pl\<Sk + e/2, ||w)||i<||/li+e/2. 

It follows that with appropriate representation, the following inequality 
holds for all e > 0: 

(24) AWk<Sk + \\f\\i+e. 

We now proceed to show that even in the worse case, the value A{fk-\-i) — 
A{f) decreases from A{fk) — A{f) by a reasonable quantity. 

The basic idea is to upper bound the minimum of a set of numbers by an 
appropriately chosen weighted average of these numbers. This proof tech- 
nique, which we shall call "averaging method," was used in [1, 20, 25, 37] 
for analysis of greedy- type algorithms. 

For hk that satisfies (10), the symmetry of A/; implies /ifcsign(u;^ ~ Pk) ^ 
Afc. Therefore the approximate minimization step (3) implies that for all 
g E Sk, we have 

A{fk+i) < A{fk + hkS^g) +ek, s3 = sign(u;5 _ ^1), 

Now multiply the above inequality by — j3l.\ and sum over g £ Sk; we 
obtain 

(25) AWk{A{fk+i) -ek)<J2 l/^f " ^Wk + hkS^g) =: B{hk). 

We only need to upper bound B{hk), which in turn gives an upper bound 
on A{fk+i). 

We recall a simple but important property of a convex function that 
follows directly from the definition of convexity of A{f) as a function of 
/:for ah /i,/2 

(26) A{f2) > A{h) + VA{hfif2 - h). 

If A{fk) - A{f) < 0, then AA{fk) = 0. From G and (3), we obtain 
A{h+i) - A{f) < A{h) - A{f) +ek< e-fc. 
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which imphes (13). Hence the lemma holds in this case. Therefore in the 
following, we assume that A{fk) — A(f) > 0. 

Using Taylor expansion, we can bound each term on the right-hand side 
of (25) as 

A{fk + hkS^g)<A{fk) + hksaVA{fkfg + ^ sup AlJ^kkS^). 



2 «e[o,i] 



Since Assumption 3.1 implies that 



sup AlJ^hkS^)= sup A';^+5/,„g(0)<M(||/fc||i + /ifc), 

|6[0,1] ' «6[0,1] 

we have 

A{fk + hks^g) < A{fk) + hkS^VAifkfg + ^M(||M|i + hk). 
Taking a weighted average, we have 



geSk 

< \Pi-w''\\Mfk)+yA{f,fhks3g+^M{\\fk\\i+hk) 



= AWkAifk) + hkVAifkfif- h) + ^AWkM{\\fk\\i + hu) 

< AWkAifk) + hk[A{f) - A{fk)] + ^AWkMiWfkWi + h^). 

The last inequality follows from (26). Now using (25) and the bound ||/a;||i + 
hk < Sfc+i, we obtain 

(^(A+i) - A{f)) -ek<(l- ^) {A{fk) - A{f)) + ^M{sk+i). 

Now replace AWk by the right-hand side of (24) with e ^ 0; we obtain the 
lemma. 

□ 

Proof of the multistep analysis or Lemma 4.2. Note that for all 
a> 0, 



n(-^)-^pEK'-^) 



i=j ^ ' \e=j 

/ k 



S£ + a \ Jsi v + a 
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Sj + a 
Sk+i + a' 

By recursively applying (13) and using the above inequality, we obtain 

AA(A+i)<n(i-^l^)AA(/o)+x: n (i-^W)^"^- 

/=o^ si + WfWi/ j^^ij+iV si + \\f\\ij 

- ^o + ll/li A.^rn I f ^.■H-l + ll/ll ,- 
A. 3. Discussion of step-size. We have been deriving our results in the 
case of restricted step-size in which the crucial small step-size condition is 
explicit. In this section we investigate the case of unrestricted step-size under 
exact minimization, for which we show that the small step-size condition is 
actually implicit if the boosting algorithm converges. The implication is that 
the consistency (and rate of convergence) results can be extended to such a 
case, although the analysis becomes more complicated. 

Let Afc = i? for all k, so that the size of in the boosting algorithm is 
unrestricted. For simplicity, we will only consider the case that sup^ M(a) is 
upper bounded by a constant M. 

Interestingly enough, although the size of ak is not restricted in the boost- 
ing algorithm itself, for certain formulations the inequality Oj < oo still 
holds. Theorem 4.1 can then be applied to show the convergence of such 
boosting procedures. For convenience, we will impose the following addi- 
tional assumption for the step-size in Algorithm 2.1: 

(27) A{fk + OkOk) = inf A(/fc + Okgk), 

which means that given the selected basis function gk, the corresponding 
is chosen to be the exact minimizer. 

Lemma A.l. Assume that Ok satisfies (27). // there exists a positive 
constant c such that 

inf inf A','-, c\f ,cf t, (0) > c, 

then 

k 

Y,a]<2c-\A{h)-A{fk+i)]. 

3=0 

Proof. Since a.k minimizes Af^^g^{a)^ A'j^ gS'^'^) ~ ^- Using Taylor ex- 
pansion, we obtain 
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where Ck G (0, 1). That is, A{fk) = A{fk+i) + ^A'^^ g^{Ckak)al. By assump- 
tion, we have A'^^ ^^(^fcafe) > c. It follows that, Vj >0, aj < 2c-^[A{fj) - 
A{fj^i)]. We can obtain the lemma by summing from j = to k. □ 



By combining Lemma A.l and Corollary 4.1, we obtain: 



Corollary A.l. Assume that sup^ M(a) < +c3o and Sj in (3) satisfies 
X^j^o^i < Assume also that in Algorithm 2.1 we let = R and let Ok 
satisfy (27). // 

then 

hm A{fk)= inf A{f). 

k^oo /gspan(5) 



Proof. If limfc^oo^(/fc) = — oo, then the conclusion is automatically 
true. Otherwise, Lemma A.l implies that X^j^o^j ^ Now choose hj = 
\aj\ + l/(j + 1) in (10); we have J2'jLo = oo, and Y1^=q < c>o. The con- 
vergence now follows from Corollary 4.1. □ 



Least squares loss. The convergence of unrestricted step-size boosting 
using least squares loss (matching pursuit) was studied in [29]. Since a scaling 
of the basis function does not change the algorithm, without loss of generality 
we can assume that Exg{X)'^ = 1 for all g G S (assume S does not contain 
function 0). In this case it is easy to check that for all g £ S, 

A}jO)=Exg{Xf = l. 

Therefore the conditions in Corollary A.l are satisfied as long as J2'jLo^j < 
oo. This shows that the matching pursuit procedure converges, that is, 

lim^(/fc)= inf A(/). 

k^oo /espan(S) 

We would like to point out that for matching pursuit, the inequality in 
Lemma A.l can be replaced by the equality 

k 

Y,a] = 2[A{fo)-A{fk+i)], 

j=0 

which was referred to as "energy conservation" in [29], and was used there 
to prove the convergence. 
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Exponential loss. The convergence behavior of boosting with exponential 
loss was previously studied by Breiman [9] for itl-trees under the assump- 
tion infx P{Y = l\x)P{Y = —l\x) > 0. Using exact computation, Breiman 
obtained an equality similar to the matching pursuit energy conservation 
equation. As part of the convergence analysis, the equality was used to show 

The following lemma shows that under a more general condition, the con- 
vergence of unrestricted boosting with exponential loss follows directly from 
Corollary A.l. This result extends that of [9], but the condition still con- 
strains the class of measures that generate the joint distribution of {X,Y). 



Lemma A. 2. Assume that 

inf Ex\g{X)\ ^P{Y = 1\X)P{Y = -1\X) > 0. 

Ifak satisfies (27), ^/len inf^ inf^g(o,i) ^(i„^)j^^^^^^^ ^^(0) >0. Hence T.jaj <oo. 

Proof. For notational simplicity, we let qx,Y = exp(— /(X)y). Recall 
that the direct computation of ^"^(0) in Section A. 1.2 yields 

[Ex,YQx,Y?AlgiO) 

,y1x,y] ~ [Ex,Y9{^)Yqx,Y] 
= [Exg{XfEYixqx,Y][ExEY\xqx,Y] - [Exg{X)Ey\xYqx,Yf 

> [Exg(,X)'^EY\xqx,Y][ExEY\xQx,Y] 
-[Exg{Xf\Ey\xYqx,Y\][Ex\EY\xYqx,Y\] 

> [Exg{XYEYqx,Y]Ex[EY\ xQx,Y - \EY\xYqx,Y\] 

> [Ex\g{X)\^ EYqx,Y{EY\x(lx.Y - \EY\xY(lx,Y\)f 

> [Ex\g{X)\^2P{Y = l\X)P{Y = -l\X)f. 

The first and the third inequalities follow from Cauchy-Schwarz, and the 
last inequality used the fact that (a -|- b){{a + 6) — |a — 6|) > 2ab. Now ob- 
serve that Ex,yQx,y = exp(A(/)). The exact minimization (27) implies that 
A{fk) < ^(/o) for all A; > 0. Therefore, using Jensen's inequality we know 
that e (0, 1),A{{1 - i)fk + Uk+i) < A{fQ). This implies the desired in- 
equality, 

> eM-2A{h))[Ex\-gk{X)\^'^P{Y = l\X)P{Y = -l\X)f. 
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□ 

Although unrestricted step-size boosting procedures can be successful in 
certain cases, for general problems we are unable to prove convergence. In 
such cases the crucial condition of J2'jLo a'j <oo, as required in the proof of 
Corollary A.l, can be violated. Although we do not have concrete examples 
at this point, we believe boosting may fail to converge when this condition 
is violated. 

For example, for logistic regression we are unable to prove a result similar 
to Lemma A. 2. The difficulty is caused by the near-linear behavior of the 
loss function toward negative infinity. This means that the second derivative 
is so small that we may take an extremely large step-size when otj is exactly 
minimized. 

Intuitively, the difficulty associated with large aj is due to the potential 
problem of large oscillation in that a greedy step may search for a suboptimal 
direction, which needs to be corrected later on. If a large step is taken 
toward the suboptimal direction, then many more additional steps have to 
be taken to correct the mistake. If the additional steps are also large, then 
we may overcorrect and go to some other suboptimal directions. In general 
it becomes difficult to keep track of the overall effect. 

A. 4. The relationship of AdaBoost and Li-marginmaximization. Given 
a real- valued classification function p{x), we consider the following discrete 
prediction rule: 



(28) 



1, ifp(x)>0, 
-1, ifp(x)<0. 

Its classification error [for simplicity we ignore the point p{x) = 0, which is 
assumed to occur rarely] is given by 



_ 1, ifp(x)y<7, 
.^(py.),y,-<^^^ ifp(x)y>7, 

with 7 = 0. In general, we may consider 7 > and the parameter 7 > 
is often referred to as margin, and we shall call the corresponding error 
function margin error. 

In [33] the authors proved that under appropriate assumptions on the 
base learner, the expected margin error with a positive margin 7 > 
also decreases exponentially. It follows that regularity assumptions of weak 
learning for AdaBoost imply the following margin condition: there exists 
7 > such that inf j^gspan(s),||/||i=i ^7(/) y) = 0, which in turn implies the 
inequality for all s > 0, 

(29) inf Ex,Y^M-sf{X)Y)<eM-ls). 

/6span(5),||/||i=l 
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We now show that under (29) the expected margin errors (with small mar- 
gin) from Algorithm 2.1 may decrease exponentially. A similar analysis was 
given in [37]. However, the boosting procedure considered there was modi- 
fied so that the estimator always stays in the scaled convex hull of the basis 
functions. This restriction is removed in the current analysis: 

/o = 0, supAfc</ifc, £k<hl/2. 

Note that this implies that < h\ for all k. 

Now applying (15) with f = sf for any s > and letting / approach the 
minimum in (29), we obtain (recall ||/||i = 1) 

Sk + S fr{Sk + S ' Sk + S 

Now let s ^ oo; we have 

fc-i 

^(A)<-7Sfc + E^'- 

j=0 

Assume we pick a constant /i < 7 and let hk = h; then 

(30) Ex,Yexp{-h{X)Y) < exp{-kh{j - h)), 

which implies that the margin error decreases exponentially for all margins 
less than ■j — h. To see this, consider -f' <^ — h. Since ||/fc||i < kh, we have 
from (30), 

W{fk{x)/\\fk\\i,y) < P{fk{x)Y < khi) 

< Ex,Y^M-fk{X)Y + khi) < expi-kh{j -h- 7')). 

Therefore 

lim Ly(/fc(x)/||/fc||i,y) = 0. 

This implies that as /i — > 0, /A;(a^)/||/A:||i achieves a margin that is within 
h of the maximum possible. Therefore, when h — > and k — > 00, /fc(x)/||/fc||i 
approaches a maximum margin separator. 

Note that in this particular case we allow a small step-size (/i < 7), which 
violates the condition J2k h\<oo imposed for the boosting algorithm to con- 
verge. However, this condition that prevents large oscillation from occurring 
is only a sufficient condition to guarantee convergence. For specific prob- 
lems, especially when inf j-gspan(S) ^(/) = —00, it is still possible to achieve 
convergence even if the condition is violated. 
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